A Note on a Characterization of Renyi Measures 
and its Relation to Composite Hypothesis Testing 
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Abstract — The Renyi information measures are characterized 
in terms of their Shannon counterparts, and properties of the 
former are recovered from first principle via the associated 
properties of the latter. Motivated by this characterization, a 
two-sensor composite hypothesis testing problem is presented, 
and the optimal worst case miss-detection exponent is obtained 
in terms of a Renyi divergence. 



I. Introduction 

The Shannon Entropy and the Kullback-Leibler divergence 
play a pivotal role in the study of information theory, large 
deviations and statistics, arising as the answer to many of 
the fundamental questions in these fields. Besides their op- 
erational importance, these quantities also possess some very 
natural properties one would expect an information measure 
to satisfy, a fact that has spurred several different axiomatic 
characterizations, see HI and references therein. 

Motivated by the axiomatic approach, Renyi suggested 
a more general class of measures satisfying some slightly 
weaker postulates, yet still intuitively appealing as measures 
of information jJl- Remarkably, this "reversed" line of thought 
has proved fruitful; the Renyi information measures have been 
shown to admit several operational interpretations, thereby 
"justifying" their definition. Among other cases, the Renyi 
entropy has appeared as a fundamental quantity in problems 
of source coding with exponential weights [3|, random search 
H, error exponents in source coding Q, generalized cutoff 
rates for source coding |6|, guessing moments f7l, privacy 
amplification [8|, predictive channel coding with transmit- 
ter side information ||9|, and redundancy-delay exponents 
in source coding |101. The Renyi divergence has emerged 
(sometimes implicitly) in the analysis of channel coding error 
exponents ifTTI . lfT2l . generaUzed cutoff rates for hypothesis 
testing fSl, multiple source adaptation [TTl, and generalized 
guessing moments [14|. Several different definitions of a Renyi 
mutual information (and the associated capacity) were tied to 
generalized cutoff rates in channel coding ifTSll . |l6l, and to 
distortion in joint source-channel coding lfT6l . 

Interestingly, even though the Shannon measures are a 
special case of the Renyi measures, the latter can admit a 
variational characterization in terms of the former. For the 
Renyi entropy (of order a < 1) this has been observed in the 
context of guessing moments Q, ifTTl . and for one definition 
of a Renyi mutual information, has been derived in the context 
of generalized cutoff rates in channel coding [6 Appendix]. 
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In this note, relations of that type and their application^ are 
further examined. Section contains the necessary mathemat- 
ical background. In Section HUl a variational characterization 
for the various Renyi measures via the Shannon measures is 
provided. In Section |IV] it is demonstrated how properties 
of the Renyi measures can be derived in a very instructive 
(and sometimes simpler) fashion directly from their varia- 
tional characterization, via the associated properties of their 
Shannon counterparts. Finally, the discussed characterization 
motivates the study of a two-sensor composite hypothesis 
testing problem in which the Renyi divergence is shown to play 
a fundamental role, yielding a new operational interpretation 
to that quantity. This observation is discussed in Section |V] 

II. Preliminaries 
A. Shannon Information Measures 

Let be a finite alphabet, and denote by ^(A") the set of 
all probability distributions over X. The support of a distribu- 
tion P e .^{X) is the set S{P) = {x X : P{x) > 0}. The 
(Shannon) entropy of P e ^{X) i^ 

def 



H{P)'^-J2Pi^)^ogP{x). 



The (Kullback-Leibler) divergence between two distributions 

Pi,P2 e ^{X) is 



i^(Pi||P2)- EPi(a;)log 



xex 



P2{X) 



We write Pi < P2 to indicate that S{Pi) C S{P2). Note that 
£'(^11^2) < 00 if and only if Pi < Pa- 
Let X,y two finite alphabets. A channel W : A" t-^ 3^ is 
a set of probabiHty distributions {VF(-|a;) G ^^{y)}x£X that 
maps a distribution P G ^^{X) to the distributions P oW & 
^{X X y) and PW G ^{y), according to 

ripf 

(P o W){x, y) = P{x)W{y\x) 

PW{v) = P{x)W{y\x) . 
xex 

For any two channels V : X ^ y ,W : X ^ y , v^t write 

D(V\\W\P) = Yl P{^)D{V{-\x)\\W{-\x)) 
xex 

' In fact, the impetus for this short study grew out of a recent work by the 
author and colleagues |10|. where the characterization for the Renyi entropy 
of order 2 has been utilized to obtain a lower bound on the redundancy-delay 
exponent in lossless source coding. 

^We use the conventions log = 0, and a log ^ = or -|- 00 according 
to whether a = or a > respectively. 



The (Shannon) mutual information associated with P and W 
is 

def 



corresponding to ([T]l, and 



/(P, W) = H{PW) - P{x)H{W{-\x)) 

^mmJ2Pix)D{W{-\x)\\Q) (1) 

(2) 



= imni:)(Po Vt^IlP X Q) 

where the identities are well known. The (Shannon) capacity 
of a channel W is 

C{W) = max/(P,W^) 

A distribution P G ^{X) induces a product distribution 

P" e ^(A""), where P"(x") = llk=iPixk)- The fype 
of a sequence x" G ^Y" is the distribution tt^'^ G 3^{X) 
corresponding to the relative frequency of symbols in x''\ 
The set of all possible types of sequences x" is denoted 
^"(A"). The type class of any type Q G ^"{X) is the set 
Tq = {x" G A-" : TT,. = Q}. 

The following facts are well known fTS\. 

Lemma 1: For any type Q G ^"(A") and any a;" G Tg: 

(ii) |^"(A')|-i2"-f^(Q) < |Tq| < 2"-f^(<3). 

(iii) l^'^CA")! = ("fifi^') < (n + 1)1-^1. 

(iv) For any S > 

P"({x" G A-" : D{tt^^\\P) > 6}) < |^"(A')|2-"* . 

B. Renyi Information Measures 

Let a > 0, a ^ 1 throughout. The TJeny/ entropy of order 
a of a distribution P G ^(A) is 

def 1 



i/a(P) 



I- a 



log > ^ P(a:)" 



We denote by i?o l^') , Hi (P) and iJoo(P) the limits of (P) 
as a tends to 0, 1 and oo, respectivel}|j. The Renyi divergence 
of order a between two distributions PijPj G £^{X) i^l 

D^{Pi\\P2) = ^ log PiixTP^ixf-" . 

We denote by i?o(A||P2), ^i(A||P2) and i^oo(Pi||P2) the 
limits of P'q(Pi||P2) as a tends to 0, 1 and oo, respectively-'. 
Note that for q < 1, Da{Pi IIP2) < 00 if and only if 5(Pi) n 
5'(P2) ^ 0, and for a > 1, Da{Pi\\P2) < 00 if and only if 
Pi « P2. 

The Renyi equivalent of the Shannon mutual information 
has several different definitions, each generalizing a different 
expansion of the latter, see |6| and references therein. Here 
we discuss the following two alternatives: 



/a(P,W^) = miii^ P(a;)i^„(W^(-|x)l|Q) 



(3) 



'These limits are known to exist, a fact we reestablish in the sequel. 
"^For a > 1 we adopt the convention where a" ■ 0^"" = or + oo 



Hpf 

Kc,{P,W) = minI?„(Po VF||P x Q) 



(4) 



corresponding to (|2]l. Following (|6l, we define the capacity of 
order a of W via (|3]l, i.e., 

Cc{W) = max/„(P,W^) 

However, using dUi in the definition yields the same capacity 
function 16 1, a fact we reaffirm in the sequel. 

III. Characterization 

In this section, we derive the basic characterization for the 
various Renyi measures in terms of the Shannon measures. 
Theorem 1: For a > 1, 



Hc^iP) 


= min \ 
Q I 


Do.{Pl\\P2) 


— max 




= max i 
V 1 


Ka.{P,W) 


— max \ 
Q 1 



■D{Q\\P) + H{Q) 



(5) 



DiQ\\Pi) + D{Q\\P2)\ (6) 



1-a 
1 



1 — a 



For a < 1, replace min with max and vice versa. 

Remark 1: The a < 1 counterpart of Q is mentioned in 
Q, ifTTl . Both (|5]l and (|6]l are simple generalizations, for which 
we provide an elementary proof. Relation (|7]) can be found in 
Il6i Appendix], however here we provide a slightly different 
proof directly via (|6|l. Relation ^ appears to be new. 

Proof: Let Xi = S{Pi) and X2 = S{P2) for short. We 
derive a characterization for the functional 

def 

— log 

xeXi 



Jc.APi-P2) - - log Y Pi{^rP2{^f 



(9) 



according to whether a = or a > respectively. 



for any a > and (3. This will yield (|5]i and ^ in 
particular, and will also prove useful in the sequel. It is readily 
verified that the functional is additive, i.e., /3(P", PJ') = 
nJa.piPi, P2)- Therefore, 

Ja.MPl,P2) = --l0g V Pl(x")"P2(x")^ 

n 

x"ex]^ 

<_ilog 2-"("(^(<3ll^i)+-f^W))+^(^WII^2)+-f/(Q)) 
Qei^^iXi) 

X \^"{Xi)\-^2""^^^ 
< ^^min^ JaD{Q\\Pi) + (3D{Q\\P2) + (« + /?- l)H{Q)} 

, |Ai|log(n + l) 
n 

where properties (0 and ^ of Lemma [T] were used in the first 
inequality, and property dm] ) was used in the second inequality. 



Similarly, 

>_ilog 2-"("^('3ll^i)+^^('3ll^'2)+(a+/J-i)H(Q)) 

~ n ^ 

> mm {aD{Q\\Pi)+pD{Q\\P2) + {a + /3-l)H{Q)} 

\Xi\log{n + l) 
n 

1J„ ^"{Xi) is dense in ^{Xi), and the objective function is 
continuous in Q over the compact set ^(.^i n ^2), and equals 
±00 over ^{Xi) \ ^{Xi n X2) according to sign(/3). Thus, 
taking the limit as n — 00, we obtain: 



(10) 



= mn {aD{Q\\P^) + PDiQWP^) + (a + /3 - l)H{Q)] . 

The statement for Ha{P) (resp. Dct{Pi\\P2)) now follows by 
substituting (3 — Q (resp. /3 = 1 — a), normalizing by a — 1 
(resp. 1 — a), and noting the possible change in sign that 
replaces min with max. For Ha{P), taking the min or max 
over all Q G ^{X) does not change anything. 

We now turn to prove (|7]i and (l8]l. As in [6|, the minimum in 
© and dUi can be replaced with an infimum over distributions 
Q with S{Q) = y, merely excluding possibly infinite values. 
This will be implicit below. For a > 1, we have 

inf E ^(^). (T^DiR\\Wi-\x))+DiR\\Q) 
infmax^ P(a;) (^-^D{V{-\x)\\W{-\x)) 

+ D{Vi-\x)\\Q) 



^^m^xmi(-^D{V\\W\P) 
V Q \1 ~ a 



Y,P(.^)Divi-\^)\\Q) 



xex 

max |/(P, V) + j^D{V\\W\P) 

(11) 

The maximization is taken over all channels V such that 
PoV PoW. The equalities above are justified as follows: 

(a) by virtue of Theorem [T] 

(b) the objective function is continuous and concav^ in V 
over a compact set for any fixed Q, and convex in Q for 
any fixed V. Hence, max and inf can be interchanged Iil9i 
Theorem 4.2]. 

(c) on account of ([TJ. 

This establishes ^ for a > lH The simpler derivation for 
a < 1 is similar. 

^Concavity in V follows by writing each of the summands as 
[D{V{-\x)\\Q) - DiVi-\x)\\W{-\x))] + ^D{Vi-\x)\\W{-\x)\ which is 
the sum of a linear function and a concave function in V (for a > 1). 

'Taking the last max over all channels V : X ^ y changes nothing. 



To establish ([8]l, write: 

inf max | ^-^D{P' o V\\P o W) 



P'ov 11 - a 



D(P' oV\\Px Q) 



= maxinf I— ^i:>(P'o\/|lPoVF) 
P'oV Q Kl - a 



+ D{P' oV\\P X Q) 



max inf \ D{P' o V\\P oW) + D(P'\\P) 

P'oV Q 1 1 — a 



+ D{P' oV\\P' X Q) 

max \--^D(P' o V\\P oW) + D(P'\\P) 
P'oV L 1 — a 

+ I{P',V) 

max \^D{V\\W\P') + -^D{P'\\P) 
P'oV 11 — a I — a 



+ HP',v)] 



id) 



max<^/„(P', W^) + D(P'\\P) 

P' I 1 - a 



(12) 



The maximization is over all P' and V such that P' oV -^i 
P o W. Equalities (a) and (b) are justified similarly to their 
counterparts in ( fTTT i. while (c) and (d) follows from ^ and 
(|7]i respectively. This establishes ^ for a > iQ The simpler 
derivation for a < 1 is similar ■ 

IV. Properties Revisited 

In this section, we derive some well known and lesser 
known properties of the Renyi measures directly via the 
characterization in Theorem [l] and the associated properties 
of the Shannon measures. These alternative derivations appear 
in many cases more instructive than a direct proof, and are 
sometimes simpler. 

A. H^{P) 

For convenience, define: 

G„(P; Q) -^D{Q\\P) + H{Q) . 
a — 1 

We will repeatedly use the fact that by Theorem[T] Gq(P; Q) 
is an upper (resp. lower) bound for Ha{P) for a > 1 (resp. 
a < 1). Without loss of generality, we will restrict Q ^ P in 
Theorem [T] throughout. 

1 . Ha (P) is a non-increasing function of a. 

Proof: For any fixed Q, Ga{P',Q) is non-increasing 
in a over (0,1) (resp. (l,C!o)). By Theorem [T] Ha{P) is 
the maximum (resp. minimum) of Ga{P]Q) taken over 

'Taking the last max over all P' G .d^{X) changes nothing. 



Q, hence it is also non-increasing in a over (0, 1) (resp. 
(1, oo)). To order the two regions, we note that for a <1 

a 



Ho.{P)>G^{P-P) 



a-1 



D{P\\P) + H{P) = H{P) 



and similarly for a > 1 we have Ha{P) < H{P). 

2. Ha{P) is concave in P for a < 1. 

Proof: H{Q) is concave in Q and D{Q\\P) is convex 
in (P, Q), hence Ga{P] Q) is concave in (P, Q) for a < 1. 
The statement follows since maximizing a concave func- 
tion over a convex set (^{S{P)) in this case) preserves 
concavity. 

3. i/o(P)=log|5(P)|. 

Proof: Let Q' be the uniform distribution over S{P). 
Then on the one hand. 



Ho{P) > lim -D{Q'\\P) + H{Q') 

a-i-O V a — 1 

= H{Q') = \og\S{P)\ 
and on the other hand, 

a 



Ho{P) = lim max 



-DiQ\\P) + H{Q) 



a^f) Q«P I a - 1 

< max7J(Q) = log I S'(P) I . 
~ Q<gp 

4. Hoo{P) = -logmsxxdX P{x): 

Proof: Let (3'(a;') = 1, where x' G -Y satisfies P(x') 
TivsiyLxi^x P{x)- Then on the one hand. 



Hoo{P) < lim 



a-1 



I?(Q'||P) + iJ(Q') 



= Z)(Q'||P) = -logP(a;') = - logmaxP(.T) 
and on the other hand, 

Hoo{P)> lim f min {i?(g||P)+i/(g)} 



+ mm ^ 

Q<P I a - 1 



= mjn{D{Q\\P) + Hm 



= — log max P{x) . 

5. Hi{P) = H{P) 

Proof: We consider the limit a — 1+, the other limit 
follows similarly and coincides. We have akeady seen that 
for a > 1, Ha{P) < Ga{P]P) = H{P). Intuitively, 
Q — P must be set in Gq as above, since otherwise the 
divergence terms blows up. Precisely, fix some r > H{P) 
and define M„ = [Q : ^D{Q\\P) < r}. Then 



HaiP) 



lim inf 

a^l+ QeMa la — 1 



-D{Q\\P)+H{Q) 



> lim inf H(Q) = H(P) . 

where the last equality holds since supQgjv/„ P'iQWP) ^ 
as a — !> 1+. 



6. The general inequality i?„(P) < ^D{Q\\P) + H(Q) 
for a > 1 and Q ^ P (and its reversed counterpart for 
a < 1) is equivalent to the log-sum inequality. Specifically, 
a uniform Q corresponds to the arithmetic- geometric mean 
inequality. 

Proof: By direct computation. 

7. Let : A" n- N be a codelength assignment associated 
with some uniquely decodable code for P. Define the 
exponentially weighted average codelength with parameter 
A > for associated with (P, t) to be@ 



(13) 



C,{PJ)'^\\ogY^P{x)2^'^^K 

Then the optimal codelength satisfies: 

-ff _L (P) < min Cx (P, I) < (P) + 1 



Proof: We reestablish this result from fi\ via our 

def 

approach. Define the probability distribution R{x) = 
2-e{x) /c, where c = J^x 2"^^"'^ < 1 by Kraft's inequality. 
Then 

Cx{P,£) = - lege + 1 log ^ P(a;)i?(x)-^ . (14) 

Let C\{P, R) be the second summand above. When min- 
imizing over all distributions R, it is clearly sufficient to 
take the infimum over those with S{R) = S{P), which for 
brevity will be implicit below. Hence: 

min £a iP,R)^ inf Cx (P, R) 
Re.^^(x) R 



(fc) 



(c) 



inf max {-A-iD(g||P) + P>(Q||P) + F(g)} 
max inf {-X-^D{Q\\P) + D{Q\\R) + H{Q)] 



y max {-\-'D{Q\\P) + H{Q)] - (P) . 

The equalities are justified as follows: 

(a) on account of ( fTOl i. by setting a ~ 1 and /3 = — A. 

(b) the objective function is concav^ and continuous in Q 
over the compact set ^(5(P)) for any fixed R, and 
convex in R for any fixed Q. Hence, max and inf can 
be interchanged |19 Theorem 4.2]. 

(c) by virtue of Theorem [T] 

This immediately establishes the lower bound. The associ- 
ated saddle point is therefore {Q*,Q*), where Q* is the 
optimizing distribution for H^^{P), hence C\{P,Q*) = 
H^iP)- Plugging e{x) = f-logg*(a;)] in O estab- 
lishes the upper bound. 

The unique optimizing distribution for Ga{P; Q) is 



Q*{x) 



P{x)°' 



ExexPi^r 



Note that A — > yields the usual average codelength criterion, and A — > 
oo yields the maximal codelength criterion. 

'The first summand is concave in Q, while the sum of the last two is linear. 



Proof: Verify by substitution that Ga{P;Q*) = 
Ha{P)- Uniqueness follows from strict convexity (resp. 
concavity) of Ga{P;Q) in Q over ^{S{P)) for a > 1 
(resp. a < 1). 

9. (Approximate recursivity) Suppose P' is obtained from 
P by combining the symbols xi,X2 (with probabilities 
P{xi) — pi and P{x2) = P2) into a single symbol xi, 
i.e., P'{xi) = pi +P2 and P'{x2) — 0, while retaining all 
other probabilities. Ther0 



HjP) = H^{P') + c-H„ 



Pi 



•.Pl +P2 

where c satisfies 

K +p^) ■ 2("-i)^°(^) < c < (pi +P2r ■ 2("-i)^°(^') 

(15) 

for a > 1, and the reversed inequalities for a < 1. Note 
that < c < 1, and c pi + p2 as a ^> 1. 

Proof: We prove for a > 1, the derivation for a < 1 
is similar with the inequalities reversed. Let Q* minimize 
Ga{P;Q), and write Q*{xi) = ql,Q*{xi) = gj. Let Q' 
be obtained from Q* by combining xi,X2 as above. Then: 

Hc.{P')<Gc.{P';Q') 

a ( _ _ _ q\ II Pl 



1 



D{P\\Q*)-{^^+ql)D 



HiQ*)~iql+q*2)H 



ql +95 pi+ P2 



ql + q*2 



< H^iP) ~ (ql + q*2)H^ 



Pl 



Pl +P2 



The recursivity properties of the Shannon entropy and the 
Kullback-Leibler divergence were used in the equality tran- 
sition. The last inequality follows by applying Theorem [T] 
twice, and using the definition of Q* . Appealing to Property 
IIV-BI8I above, the lower bound in ( fTSl ) is established. 
For the upper bound, let Q'* minimize Ga{P'\ Q)- Let the 
distribution Q be obtained from Q'* by splitting the proba- 
biUty Q'*{xi) between xi and X2 such that q^J^)^q(^x2) ^ 
Pl 



P1+P2 



, while retaining all other probabilities. The bound 
follows by expanding the inequality Ha{P) < Ga{P;Q) 
as above, using recursivity, Theorem[T]and Propertv llV-BI8l 



-D{Q\\P,) + D{Q\\P2). 



B. D^{P^\\P2) 

For convenience, define: 

Gc,{Pi,P2;Q) = j^ 

We will repeatedly use the fact that by Theorem [T] 
Ga{Pi, P2', Q) is a lower (resp. upper) bound for Da{Pi \\P2), 
for a > 1 (resp. a < 1) and any Q <C Pi. 
L -Da (Pl 11/2) is an increasing function of a. 

Proof: Similar to Property IIV-AIII by noting that 

G„(Pi,P2;Pi) = i?(Pi||P2). 
2. P'a(Pi||P2) > with equaUty if and only if Pi = P2. 



'"For binary distributions P = 

-_,{p) = Hc{P) and Dc{p\\q) 



(p, 1 — p) and 
= Dc{P\\Q). 



(q, 1 — q), we write 



Proof: For a < 1 this follows immediately from 
Theorem [T] using the same property of _D(Pi||P2). For 
a > 1 use also the monotonicity property above. 

3. -Da(Pi||P2) is convex in P2 for a > 1 and any fixed Pi, 
and is convex in the pair (Pi, P2) for a < 1. 

Proof: D{Q\\P2) is convex in P2 for any fixed Q, 
hence so is Ga{Pi, P2',Q)- The statement for a > 1 
follows since a pointwise maximum of convex functions is 
convex. For a < 1, the convexity of D{Q\\Pi) in (Q,Pi) 
and of D{Q\\P2) in (Q,P2) implies that G^{Pi,P2;Q) 
is convex in {Pi, P2,Q). The result now follows since 
minimizing a convex function over a convex set {^{S{Pi)) 
in this case) preserves convexity. 

4. Z?o(Pi||P2) = -logP2(5(Pi)). 

Proof: Let Q' be P2 restricted to 5(Pi), with the proper 
normalization. Then on the one hand. 



^o(Pi||P2) < lini 



i-i-o \ a — 1 
= D{Q'\\P2)^ 

and on the other hand, 

Z3o(Pi||P2) = lim min 



a-i-o Q«;Pi 1 1 - a 

> min D{Q\\P2) = 
Q«Pi 

= -l0gP2(5(Pl)) 



D{Q'\\Pi) + D{Q'\\P2: 
= -logP2(^(Pi)) 

-D{Q\\Pi) + D{Q\\P2 



D{Q'\\P2) 



5. P'oo(Pi||P2) = logmax^g5(p,) 

Proof: Let Q'{x') = 1, where x' € X satisfies Pi(x')/ 
P2(a;') = max2,g5(P2) (Pi(a;)/P2(a::)). The proof is now 
similar to that of Property IIV-AI4I 

6. I?l(Pl||P2) =i?(Pl||P2) 

Proof: Q = Pl must be set to avoid a blowup of the first 
divergence term in Gq(Pi, P2; Q). The proof is similar to 
that of Property IIV-AI5I 

7. (Data Processing Inequality) For any pair of distributions 
Pl, P2 e ^{X) and channel W : X ^ y, 

DaiPlW\\P2W) < Da{Pl\\P2). 

Proof: We prove only for a < 10 Let Q* minimize 
Ga{Pi,P2;Q)- Write: 

D^{PiW\\P2W) < Gc.iPiW,P2W;Q*W) 

D{Q*W\\PiW) + D{Q*W\\P2W) 



< 



1 — a 
a 



D{Q*\\Pi) + D{Q*\\P2) = DJP4P2) ^ 



The data processing inequality for the Kullback-Leibler 
divergence ifTSl was used in the last inequality. 
8. The unique optimizing distribution for Ga{Pi, P2', Q) is 

_ Pl(x)°P2(x)l-" 

^ ^ ^ Y..<,^Pi{xYP2{xY-- 

Proof: Verify by substitution that Gq(Pi,P2;Q*) = 
P'a(Pi||P2)- Uniqueness follows from strict concavity 

' ' This holds for any o > 0, however the case of a > 1 does not seem to 
follow elegantly from our representation, and can be proved directly. 



(resp. convexity) of Ga{Pi,P2;Q) in Q over ^(S'(Fi)) 
for a > 1 (resp. a < 1). 

C. Ia,iP,W), Kc,iP,W) and Ca{W) 

1. Ka{P,W) < Ic{P,W) for a > 1, and > 
/„(P,T^) for a>l. 

Proof: Immediate from Theorem [T] by substituting 
Q = P in the expressions for Ka{P, W). 

2. Ia{P,W) < H{P) and Ka{P,W) < Hi{P), with 
equality if and only if I{P, W) = H{P). 

Proof: From (|7]i we have that for a > 1 



Io,{P,W) < max/(P,l/) 



H{P) 



A necessary and sufficient condition for an equality is 
I{P,V) = H{P) and D{V\\W\P) = for some V, 
implying P o W — P o V, hence the first assertion. Using 
this inequality in (O, along with the max counterpart of 
(Is), yields 

Ka.{P,W) < max + j^D{P'\\P)^ =^i(^) 

which inherits the same equality condition, hence the 
second assertion. For a < 1, substituting V — W in the 
min counterpart of ^ yields 

laiP, W) < I{P, W) < H{P) 

If I{P,W) = H{P) then I{P,V) = H{P) for the 
minimizing V, hence = for x G S{P) 

is optimal. The other direction is trivial, hence the first 
assertion. The second assertion follows similarly as above. 

3. /q(P, W) is concave in P for any fixed W and any a, and 
is convex in W for any fixed P and a < 1. 

Proof: In this case working directly with ([3]) is much 
easier Concavity in P follows as a pointwise minimum of 
concave (in fact linear) functions in P. Convexity in W for 
a < 1 follows (using Property IIV-BI3I ) as a minimization 
of a convex function in (Q, W) over a convex set. 

4. For a > 1, Ka{P,W) is concave in P for any fixed W, 
and convex in W for any fixed P. 

Proof: Using ([8]l and the previous property, concavity 
in P follows as a maximization of concave functions in 
(P, Q) over a convex set. Convexity in W follows as a 
pointwise maximum of convex functions in W . 

5. Ca{W) = maxp Ko,{P, W). 

Proof: For a > 1 this is immediate from (O. The case 
of a < 1 does not follow simply from our representation, 
see ||6l. 

6. (Data Processing Inequality) For any distribution P G 
^{X) and channels Wi : X ^ y ,W2 : y ^ Z, 

Io.{P,WiW2) < Ic.{P,Wi) 

K„{P,WxW2) < K^{P,Wi) 

where Wi W2 is the concatenation of the channels Wi and 
W2, i.e., {WiW2){z\x) ^^J2yW2{z\y )Wi{y\x ). 
Proof: Similar- to that of Property IIV-BI7I 



V. A Composite Hypothesis Testing Problem 

Suppose two sensors monitor the occurrence of some phe- 
nomena. The sensors may generally have different sampling 
rates with some ratio A > 0, i.e., for each sample provided by 
Sensor 1, A samples are provided by Sensor 2. When the phe- 
nomena is present, it is observed at Sensor 1 as i.i.d. samples 
from an unknown distribution Pi in some given family Pi C 
3^{X), and at Sensor 2 as i.i.d. samples from an unknown 
distribution P2 in some given family P2 C 3^[X). When the 
phenomena is absent, both sensors observe i.i.d. samples from 
a common unknown "ambient noise" distribution Q in some 
given family Q C ^{X). The samples obtained form the 
sensors are assumed to be mutually independent under each 
hypothesis. 

Suppose we are given n samples from the two Sensors 
together, where the first nisamples are from Sensor 1, and 
the last n2 = Xni sampleo are from Sensor 2. A decision 
rule corresponds to a set i7„ C X", which is allowed to be 
a function of the families Pi,P2,Q, but not of the actual 
{Pi, P2,Q)- The decision rule declares "phenomena" if the 
sample vector lies in r2„, and "no phenomena" otherwise. The 
miss-detection and false-alarm error probabilities associated 
with iln and a triplet (Pi, P2, Q) are 



(l]„|Pl,P2)°i:'p(") (A'"\l]„) 



def 



where P*^"^ =^ P"^ x P2"^. The miss-detection exponent 
associated with a sequence ft = {i^n}5^Li of decision rules is 

£;,,^(r!|Pi,P2) = liminf--logp,,«(a.|Pi,P2) ■ 

n— f 00 77, 

We will be interested here in maximizing the worst-case 
mistedection exponent while guaranteing a vanishing false- 
alarm probability, over all feasible {Pi, P2,Q). Namely, we 
will consider 

def 



E:,,= sup ^ inf ^ E,.^{n\Pi,P2) 
ne-:^ -PiePi.P2eP2 



where 



^ = in : lim p^A{^n\Q) = 0, VQ e q| . 

In what follows, let (5„ '^^ log" ^ families 
P,P' C ^(A"), define 



i?„(P||P')= inf D^{P\\P') 
PeP,P'eP' 



(16) 



Furthermore, write Q* for the closure of the family of all 
distributions of the form 



Q*{x) = 



Pl{x) i + ^P2(x) i + A 



E.exPii^)~P2{^) — 
for some Pi G Pi, P2 £ P2- 

'^For brevity, we disregard integer issues. 



Example 1: The case where A = (single sensor) corre- 
sponds to a classical setting of composite hypothesis testing. 
It is well known that in this case l,20J 



El,„ = D(Q||Pi) 
which can be achieved by the decision rule 



inf Z?(7r,.||Q) > 5n 
QeQ 



(17) 



Example 2: If Pi n P2 n Q ^ 0, then El,^ = for any A. 

Example 3: Suppose Pi and P2 have disjoint supports, i.e., 
S{Pi) n S{P2) = for all Pi e Pi and P2 G P2 Then 
EIj^ = 00 regardless of Q. This is achieved by a simple 
decision rule that declares "phenomena" when the empirical 
supports of the samples from the sensors are disjoint, and 
"no phenomena" otherwise. Clearly, this rule has a zero miss- 
detection probability for any n. It is also easy to see that 
its false-alarm probability tends to zero exponentially for any 
Q e 3^{X). 

Generally, one would expect the optimal miss-detection 
exponent to be related to some measure of disparity between 
the families Pi and P2, quantifying the fact that the noise Q 
cannot mimic both P\ and P2 too well at the same time. As 
it turns out, at least in the worst case sense over the choice of 
Q, this measure is related to a Renyi divergence between the 
two families. 

Theorem 2: For any choice of Pi, P2, Q and A, 

> A(i + A)-ii? 1 (P1IIP2) 

i-|-A 

with equality if and only if the closure of Q has an nonempty 
intersection with the associated Q*. 

Proof: Consider first the case where Q = {Q}. Let us 
show that 



El,, = (1 + A)-i (I?(Q||Pi) + XDiQWT^)) ■ 

Achievability follows by letting Vln} and Q,n} be the optimal 
per-sensor decision rules as in ( fTTI l. and setting 



or y" 



^^nl] ■ (18) 



The converse is a simple generalization of the standard single- 
sensor case |20|. Let ft' ~ {^n} ™y sequence of decision 
rules achieving a vanishing false-alarm probability. For i G 
{1,2}, let r„. denote the union of all rti-dimensional type 
classes Tg, where e satisfies D{Q^\\Q) < Sn,- 

By Lemma [T] property we have (3"(r„-^ x Tn^) — > 1 as 
n ^ 00. Since by our assumption \ fiJJ 1, then 

(5"((r„^ xTn2)\^'n) — \ (s^y) ^ large enough. Thus, 

there must exist a pair of types (Qi,n, Q2,n) S r„ j x such 
that Q"((rQ,„ X Tq. J \ r!;) > iQ"(rQ,,„ x Tq, J. since 
both Q" and P*^"^ are constant over Tq^ ^ x Tq^ the same 



inequality holds for P^"^. Therefore, 

~iiogp(")(A'"\f]:,) 

n 

< _i log p(«) ((rQ,„ xTq,, J 

n 

< log ^/'("^ (Tq,,„ xTq,,J 

< (1 + A)-i p(Qi,„||Pi) + AD(02,„||P2)) 

l + 2|A-|log(n + l) 
n 

where properties dill-diiill of Lemma [T] were used in the last 
inequality. Letting n — > 00, and recalling that D{Qi_n\\Q) — > 
which implies D{Qi^n\\Pi) D{Q\\Pi), the converse follows. 
As a result, it is now clear that for a general Q 

El,, < (1 + A)-i inf (i^(QIIPi) + Ai^(Q||P2)) • (19) 

The decision rule ( fTSl l above (with Vl'^n} and vin2 now taking 
the infimum over the family Q ) will generally fail to achieve 
the upper bound in ( fT9l ), and may even not attain a vanishing 
miss-detection probability. For instance, if Pi — {Pi}, P2 — 
{P2} and Q = {Pi, P2}, then PMD(aj A, L whereas 

the upper bound ( fT9] l is positive if Pi ^ Pj. Clearly, the 
problem is that each sensor makes its own binary decision 
before those are combined, not taking into account that Q is 
common. This shortcoming is easily corrected by the following 
modified decision rule: 



^n= |(x"\2/"^) : mf^max{7?(7r,.i||g),D(^j,..||g)} > 5[, 

where 5'^ = max((5„j , (5„J. 

Let us show that this rule attains the upper bound in 
( fT9] l. For any QeQ, VLn is contained in the set of all 
vectors (a;"^?/"^) for which either D{tTx^i\\Q) > or 
D{-Kyn2 \\Q) > 5'^. Thus, using Lemma [T]property together 
with the union bound, we obtain 



ni + \X\ - 

\X\ - 1 



— ni S' 



+ |^"^(A')|2-"^*" 



< 



n2 + \X\ - 

\X\ - 1 



-\x\ 



\x\ 



hence ppi{nn\Q) — ?> as n — !> 00, for any Q G Q. 

Define the set U„ C {X) x (X) of all the type 
pairs {Qi,Q2) for which there exists some Q G Q such that 
D{Qi\\Q) < and D{Q2\\Q) < S'„. By definition, X'' \ 
iln is a union of all type classes products pertaining to n„. 
Therefore, using properties (|i|l-(|iv]l of Lemma [T] again, we get 



-log 

n 



(Qi,Q2)6n„ 

> (l + A)-imin {D{Qi\\P^) + \D{Q2\\P2)) 
(Qi,Q2)en„ 

2|<Y|log(n + l) 



Let (Qi,ri,Q2,n) achieve the minimum above. Then by defi- 
nition there exists Qn G Q such that D{Qi^n\\Qn) < ^'n ^ ^ 



for i G {1,2}, which imphes that D{Qn^i\\P,) D{Qn\\Pt)- 
Hence for any Pi G Pi, P2 G P2, 

£;,,„(f2|Pi,P2) > (l + A)-i inf {D{Q\\P,) + XD{Q\\P2)) 

Therefore, ft attains the upper bound in ( fT9l t, and thus 

Ko = (1 + A)-^ inf (I?(Q||Pi) + XDiQWP^)) (20) 

>(l + A)-i min {D{Q\\Pi) + XD{Q\\P2)) 

= A(l + A)-ii?_^(Pi||P2) 

where the inequality is on account of Theorem fTf*^ Property 
IIV-BI8I verifies the necessary and sufficient conditions for an 
equality. ■ 
The lower bound in Theorem |2] is independent of the noise 
family Q, hence the Renyi divergence between the families 
Pi and P2 admits an operational interpretation as the optimal 
worst-case miss-detection exponent (up to a constant) when 
the noise distribution Q is completely unknown (i.e., Q = 
!3^{X)), or more generally, when Q can take values in the 
"worst noise" set Q*. In other cases this serves only as a 
lower bound, and the strictly larger exponent is given by ( |20] |. 
It is possible (somewhat artificially) to interpret this exponent 
as a (limit of a) generalized form of the Renyi divergence, 
taking into account also the family Q, as we now proceed to 
show. 

Let (ai, . . . , a/c+i) be a probability vector, and write 

def 

a = (ai, . . . , a/c). Let {Pi, . . . , Pfc+i} be distributions over 
^{X). We define the associated generalized Renyi divergence 
of order a to be 

7^„(Pi, . . . , Pfc+i) - log ( ^ n ^'(^)"' ) ■ 

For families of distributions {Pi, . . . , Pk+i}, we define 

Z?„(Pi,...,Pk+i) = inf D^{Pi,...,Pk+i). 

Additivity of the generalized Renyi divergence is easily veri- 
fied, which leads to 
Corollary 1: 

fc+i 

D^{Pi,...,Pk+i)= min y^ajD{Q\\Pj). 

Theorem 3: For any < a < + 
E:,^ > (l + A)-ia-ii?(„,Aa)(Pi,P2,Q) = i?:,^(a) 

Furthermore, El,j^{a) is monotonically non-increasing in a, 
and if £'* < 00 then 



a— !-0+ 



'^Note that for the a < 1 counterpart of (6), minimizing over Q S .^{X) 
instead of Q <C f 1 changes nothing. 



Proof: 

E:.o = (1 + A)-^ inf (2?(Q||Pi) + XD{Q\\F2)) 

> (1 + A)-i ^ mm^^ (^(O'llPi) + AZ?(Q'||P2) 

+ {a-^-{X + l))D{Q'\\Q)) 
= (l + A)-ia-ii?(„,A„)(Pi,P2,Q). 

Monotonicity is clear from the second line above. Tightness 
in the limit is proved in a similar way to Property IIV-AI5 1 by 
noting that El^^ < 00 imphes P'(Q'||Q) — > as q — s> for 
the optimizing Q'. ■ 
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